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Abstract 

Identifiability of evolutionary tree models has been a recent topic of 
nj , discussion and some models have been shown to be non-identifiable. A 

Q, ■ coalescent-based rooted population tree model, originally proposed by 

Nielsen et al. 1998 [2], has been used by many authors in the last few 

years and is a simple tool to accurately model the changes in allele fre- 

^ ' quencies in the tree. However, the identifiability of this model has never 

been proven. Here we prove this model to be identifiable by showing that 
the model parameters can be expressed as functions of the probability 
distributions of subsamples. This a step toward proving the consistency 
of the maximum likelihood estimator of the population tree based on this 
model. 
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^ ■ 1 Introduction 

rn . 

■^^ . A rooted evolutionary tree is a directed weighted tree graph; it represents the 

>— ^ ' evolutionary relationship between groups (also called taxa) of organisms (Figure 

1(a)). A leaf or a tip is a node with degree 1; each tip represents a modern 
day taxon. The root (node 0) represents the most recent common ancestor 
(MRCA) of all the taxa. The direction (of evolution) is from the root to the 
tips. Evolutionary tree as a vector of parameters influences the probability 
^ . distribution of alleles at the tips. 

A rooted population tree is a rooted evolutionary tree where the taxa are 
populations from the same species. Two types of parameters are common in any 
model of the rooted population tree: the tree-topology parameter (a categorical 
parameter) for the whole tree, and a branch parameter for each branch (also 
called edge). 

The tree-topology is the order in which the path from the root separates for 
the given set of populations; it is represented as a directed tree graph without 
the weight. (In Figure 1(a) and (b), the two trees have different tree-topologies 
for the populations 1-4.) A branch parameter is usually a branch-length (an 
edge-weight) or a transition probability matrix that influences the change in 
allele frequency between the two nodes of a branch. 
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Figure 1: Population trees 



Here we will prove the identifiability of a population tree model by [21 [S] 
that uses Kingman's Coalescent Process ([3])- The model was later modified 
and expanded by various authors ( [H O [SI [Zl H] ) • Coalescent-based models are 
of significant importance as they model the underlying allele frequency changes 
with accuracy and relative ease (see [IS])- 

Due to the underlying structure in evolutionary tree-based models, its iden- 
tifiability is never obvious. The identifiability of certain evolutionary tree mod- 
els have been a recent topic of discussion. [T] proved the identifiability of a 
general time reversible (GTR) transition probability matrix-based model. Non- 
idcntifiability of another time reversible model was established in [TU]. The 
non- identifiability of mixture models have been discussed in jllj . The iden- 
tifiability for the [12] model has been proven by [9]. To our knowledge the 
identifiability of the coalescent-based model of [J', "S" has never been proven. 

For estimating evolutionary trees each independent genetic locus is viewed as 
a single data-point, as opposed to viewing each individual as a data-point (see, 
for example [B]). Thus, identifiability would mean that the model parameters 
can be identified from the distribution of allele-types for a set of individuals at 
a single genetic locus. 

2 The model 

In this section we will describe the underlying model of [U [S]. We start by 
defining our notations (see also Figure 1(c)). We define a P-tip population tree 
as T = {A'--^\ "^ , 9) . The parameter A^^^ is the tree-topology, an unweighted 
directed tree-graph; it takes finitely many discrete categorical values; the (P) in 
superscript denotes the number of tips. The parameter * = (ti, T2, . . . , T2P-2) 
is a vector of length 2P — 2 consisting of the branch-lengths r^ for each branch i 
in A^^). A strictly bifurcating tree-topology has exactly 2P — 2 branches. If A*^"^) 
is non-bifurcating then it has less branches and the remaining elements of ^ 
are populated by zeros. The parameter is a vector containing the parameters 
of root distribution which we will define later in this section. We also define 
^(A^^') as the set of tips at A^-^). 

At each tip z there are nz{> 1) lineages, each having allele-type '0' or '1'. 
The allele types among these lineages at each tip are the observable random 
variables. Similarly, at each non-tip node x, the random variable n2,(> 1) is the 
(random) number of lineages that are ancestral to the tips below x along the 
tree. We also define the random variable r^ at each node x (tip or non-tip), as 
the count of allele '1' among the n^ lineages. From now on we will use the term 
'allele-count' to refer to the count of allele '1'. For each tip z, the allele-count 
rz is observable. 

Consider a branch with lower (towards the tips) node x and upper (towards 
the root) node y. Let n'^ be the number of lineages in y that are ancestral to 
the Ux lineages at x (n'^ < rix)- Also, let r^ be the allele-count among these 
n'^ lineages (r^ < r^)- If y is the upper node of v branches with lower nodes 



Xi,X2, ■ ■ ■ ,x,y, then 

T^xi J 1^X2 1 ■ • ■ J TT'x^ are independent, and Uj, = > n'^ (1) 

fe=i 

and also ry — X]fc=i ''"'xk- i^*^^' ^ strictly bifurcating tree v — 2.) 

From the model parameters T = (A*^^', \I', 0) one computes the probability 
of observed vector of allele-counts r = (ri,r2, . . . , fp) from samples of sizes 
n = (ni, n2, . . . , np) at P tips (1,2,..., P) as follows. Consider a branch with 
length Tajj, with upper node y and lower node xi. Given the probability mass 
function (pmf) of nj;^ (the number of lineages at xi), the pmf of n^^ is computed 
as 

Pr„«^^^'|n.,=.;r.J^( f[ A,) ^ -— ^l^^I^i— -, (2) 

where A^ = j(j — l)/2. Then, the pmf of Uy is determined from Eq. ([1]). 

Using Eqs. © and ([Ij, starting from n = (ni, ?t,2, . . . , np) and going upward, 
one computes the pmf of Uz and n'^ for any non-tip non-root node z, and finally 
no at the root (node 0). Then a 'root distribution' with parameter 9 gives the 
pmf of (allele-count) ro given no at the root: 



G(0) = (Pr„(ro = j|no = i; 9),j = 0, 1, . . . , no; i = 1, 2, 



, m 



where 

(b) Y^ 

s is a tip 

is the maximum possible value of no (number of lineages at the root). Different 
authors have used different root distributions. In particular [5] used symmetric 
Beta-Binomial distribution: 

Pr^{ro^J\no^^;e) =. ^ ^^^ , (3) 

where /3(., .) is the Beta Function; > is a parameter to be estimated. 

Then, from the distribution of no,ro and {nz,n'^J for all non-root nodes z, 
we compute the distribution of r^ (allele-counts) at the rest of the nodes as 
follows. Consider a node y where v branches merge from the bottom with the 
bottom nodes xi,X2t ■ ■ ■ ,x^. Recall that we already have the distributions of 
Uy, Uxi and n^. , i = 1, 2, . . . , z^. The pmf of r'^. is computed from the pmf of ry 
using the formula 
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(4) 



Then the pmf of Vx^ is computed from the above pmf using the fohowing (from 
an expression in |5]): 

Prn(r^fc = jk I r^, = iL <, = *fe. ^^k = «fe) 
_ I^Ukih- jk) fik -i'k 



l3{fk^i'k-J'k)\Jk-J'k 
1,0 = jfe = Jk or = ife - jfe = i'^. - j^, 

0, otherwise; (5) 

k = 1,2,...,!/ ([5]). Thus, starting with G(0) at the root, one computes the 
joint pmf of (ri, r2, . . . , rp) from the formulae in Eqs. (HI and ^. Note that 
in Eqs. ([T]), ([2]), (O, (g]) and dH) probabihty 'flows' up along n's and then flows 
down along r's. 

Now that we have completely described the model, we will proceed to prove 
the identifiability of this model in the next section. 

3 Identifiability 

Let T = (A('P), *, 9) be a tree with S'(A('P)) = {1,2,..., P}. We define a subtree 
T* of T as a tree formed by a subset S* (cardinality P' < P) of SiA^^^) by 
tracking the tips in S* along the tree to their most recent common ancestor 
(MRCA) node. Thus, T* = (A^^''*, **, 6»), where A^^')* is the tree-topology 
with P' tips of S*. For example, in Figure 1(a), P = 4, S{14) = {1,2,3,4}, 
S* = {3,4} and the subtree T* is drawn with the dotted lines. 

Consider two distinct trees Ti = {A[^\^i,9i) and T2 = (a1,^\ *2, ©2) with 
a common set of tips 5ti_2 — S{A\ ') — S{A2 ). 

If 01 = 02 = S, then there must be at least one doubleton subset {zi, Z2} C 
St^_2 with the following property: the subtrees Tj* = (A(^),*|,0) and T2* = 
(A(^\ '$'2, 0), formed by tracking zi and Z2 to the root in Ti and T2 (respec- 
tively), are distinct. That is, if ^;* = (tij, T2;) and r,; is the path distance (total 
branch length) between Zj and the MRCA of zi and Z2 along the subtree T^ 
{j,l — 1,2), then (tii,T2i) 7^ (ti2,T22). (Note that there is only one possible 
tree-topology for a two-tip tree, denoted as A*^^^ above.) Thus, the set of all two 
tip subtrees, along with 0, uniquely identifies the tree. 

We assign the two-tip subtrees into two categories: Type-I subtrees are those 
with the root as the MRCA of the two tips. For example in Figure 1(a), the 
subtree formed by tips {3, 4} has the root as the MRCA of the two tips 3 and 
4. Thus, it is of Type-I. All other two-tip subtrees are Type-II subtrees. For 
example, in Figure 1(a), if a subtree is formed by tips 2 and 4, it will be a 
Type-II subtree as their MRCA is node 6, and not the root. We will deal with 
these two types of subtrees separately. 

We note that the root distribution of 5 (Eq. ([3])) is identifiable as it is 
Beta-Binomial. Next, we will prove the identifiability of the whole model by as- 
suming a general identifiable root distribution that has parameter vector 6. (In 
particular, our proof would work with Beta-Binomial as the root distribution.) 



^ ),0<jfc <ifc andO< j^ < i'^, 

kJ 



Theorem Suppose that we have a tree T with the underlying model as 
described in Section [2l Also, suppose that we have Nk > 2 lineages sampled at 
each tip k and the root distribution is identifiable. Then the parameters of T 
are identifiable from the distribution of allele types at the tips. 

To prove the above theorem, we will show that the parameters of each two-tip 
subtree can be expressed as a function of the joint pmf 

(Pr„ ((i?!, i?2, . . . , i?p) = ( Ji, J2, . . . , Jp)\T) , Jfc = 0, 1, 2, . . . , iVfc, fc = 1, 2, . . . , F) . 



This will complete the proof as the set of all two-tip subtrees, along with 0, 
uniquely identifies the tree. 

3.1 Identifiability of Type-I subtrees 

Suppose that T — (A^^^ {ri,T2},0) is a Type-I subtree with the underlying 
model as described in Section [51 Let zi and Z2 be its two tips. Let the root 
be denoted as '0' (Figure 1(d)) and let Tfc be path distance between Zk and the 
root (fc = 1,2). 

Proposition Suppose that we have at least two lineages sampled at each 
of zi and Z2 and the root distribution is identifiable. Then ti , T2 and can be 
expressed as functions of the joint pmf of allele types in zi and Z2, and hence 
they are identifiable. 

Proof Suppose that we have samples of A^i and A'^2 lineages from zi and Z2 
respectively, and the allele-counts among these lineages are Ri and R2 respec- 
tively. Let the joint pmf of (i?i, R2) be /n,r- 

Consider random subsamples (without replacement) of size rii and 71.2 from 
Zi and Z2 respectively with n^. < 2, fc — 1,2. Rather than working with the 
allele-counts Rk at the original samples, we will work with allele-counts r^ at 
the subsamples. 

One computes the joint pmf of (ri, r2) from /n,r as 

Prn(?'fe = jfc, fc = l,2|i?fe = Jk.Nk = /fe,nfc :^ik, k^ 1,2;ti,T2,0) 

Ji=Ji ■h=j2 \fc=l yiki J 

We will argue that the joint pmfs (ri,r2) for (rii,ri2) = (1,1), (1,2) and (2,1) 
are enough to identify the parameters Ti,T2 and 6. 

As before, let n'^. be the number of lineages ancestral to subsamples at z/j 
that are present at the top node (the root) (see Figure 1(d)) and rj. be the 
allele-count out of these n'^; (fc = 1, 2). Also, let no = n'l + n'2 be the number 
of lineages at the root ancestral to the subsampled lineages at zi and Z2, and 
ro = r'l + r2 be the allele-count out of these no lineages. 

First, consider the case ni = n2 = 1. Then r^ = or 1 for fc = 1,2. From 
Eq. ([2|) it follows that n'l = n^ = 1; thus, Prn(n5j. — i' \ Uk — i\Tk) and hence 



(6) 



Pr(ri = ji, r2 = j2 | ^i = "-2 = 1; ti, T2, 9) does not involve ti and T2. From Eq. 
([5]) it also follows that Vk = r'i^,k = 1, 2. Also, ng = n'^ + n'j = 1 + 1 = 2. 
Note that rg = r'j^ + r2 and r^ = rj, (fc = 1, 2) are counts. Thus, 

(ri,r2) - (0,0) ^=^ irlr'^) = (0,0) ^=^ tq = 0. 
Using a symmetric argument 

(n, r2) = (1, 1) ^=^ (r'l, r^) = (1, 1) ^=^ ro = 2. 

Thus, 

Pr((ri,r2)-(j,j)|ni=n2 = l;0)=Pr(ro = 2j>o = 2;0), J=0,1. (7) 

It follows that 

Pr ( (r-i, r2) = (0, 1) | m = 7^2 = 1; 6») + Pr ( (n, rj) = (1, 0) | m = n2 = 1; 6) 
- 1 - Pr ( (n, ra) - (0, 0) | m = ^2 - 1; 6*) - Pr ( {r^r^) - (1, 1) | rii = ^2 = 1; 0) 
= 1 - Pr (ro - I no - 2; 0) - Pr (ro = 2 I no = 2; 0) 
= Pr(ro-l|no-2;0) (8) 

Thus, from Eqs. ([7]) and ^ Pr (ro = jo \nn = 2;0) , jo = 0,1,2 can be expressed 
as functions of Pr ( (ri,r2) — (ji, J2) | "-1 — ^2 — 1; 0) , ji, J2 = 0, 1. The former 
is the root distribution for no = 2, which is identifiable by the condition of 
Proposition [STJ Thus, 6 can the expressed as a function of the pmf of ro (given 
no = 2), and thus as a function of joint pmf of (ri,r2). Hence, it can also be 
expressed as a function of /n,r- 

Next, we consider ni = 2,n2 — 1. Then ri = 0, 1 or 2 and r2 = or 
1. From Eq. ^ it follows that n'2 = 1; thus Pr„(n2 = J2 I "-2 — *2;''"2) and 
hence Pr((ri, r2) = (0, 1) [ ni = n2 = 1; ti,T2,0) does not involve T2. Moreover, 
no = n[ + n2 = n'^ + 1. Also, from Eq. ^ it follows that 

(ri,r2) = (0,l)^^(r;,r^) = (0,l). 

Thus, 

Pr ( (ri, r2) = (0, 1) | (ni, n2) = (2, 1); n, 0) 

Pr ( (ri, r^) = (0, 1) I (n'l, n'2) = (»', 1); 0) Pr (n'l = *' I m = 2; n) 



2 

E 

i' = l 
2 

E 

i' = l 



Pr ( (r'l, r^) = (0, 1) | no = n'^ + 1 = z' + 1; 9) Pr (n'^ = i' \ m = 2; n) 



2 i' + l 

^ 5] Pr ( (r;,ri) = (0, 1) | ro = jo,rio = *' + !) 
X Pr (ro = Jo I no = i' + 1; 9) Pr (n'^ = i' | m = 2; n) 
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Note that ro ^ 1 =^ {'r[,r'2) ^ (0, 1). Also, note that 

Pr(ro-l|no-z' + l;0) 

is a function of 6 only (and no other parameters); hence we call it Ci'+i(0), i' = 
1,2. Thus, 

Pr((ri,r2) = (O,l)|(7ii,n2)-(2,l);Ti,0) 

2 

= ^ Pr ( {r'^.r'^) = (0, 1) | ro = 1, m=i + 1) c.'+i(6>) Pr (n'l = ^' | m = 2; n) 

i' = l 

^ ^^Wn ,-r.. I '^aW ^-., _ ^-., /C3(e) C2(e)\ C2(e) 

2^ ^3 V3 2/2 

from Eqs. ^ and (|H). From the above equation it follows that 

Ti=5(Pr((ri,r2) = (O,l)|(ni,n2) = (2,l);Ti,0),0) (9) 



for some function fe(., .). We have already established that can be expressed as 
a function of /n,r- Thus, ti can be expressed as a function of /n,r and hence 
Ti is identifiable. 

Using a symmetric argument, one can establish that T2 can be expressed as 
a function of /n,r and hence it is identifiable. Thus, this proposition is proven. 

3.2 Identifiability of Type-II subtrees 

Consider a Type-II subtree of with tips za and zb ■ Let the MRCA node of za 
and zb be denoted as zab- (By definition zab is not the root.) Also, consider 
the path from zab to the root (node 0) and call it branch AB. There must be 
at least another branch H attached to the root other than branch AB (Figure 
1(e)). Consider a tip zd, such that the path between zd and the root goes 
through H. Let ta be the path distance between zab and za and let tb be the 
path distance between zab and zb- Also, let tab be the path distance between 
the root and zab and let td be the path distance between the root and zu- 

Proposition Suppose that we have at least two haploids sampled at each 
of ZA, Zb and zu and the root distribution is identifiable. Then ta, tb,tab, td 
and 6 can be expressed as functions of the joint pmf of the allele types at za, zb 
and Zd, and hence they are identifiable. 

proof Suppose that we have samples of Na, Nb and Nu lineages from za, 
Zb and zd respectively, and the allele-counts among these lineages are Ra, Rb 
and Rd respectively. Let the joint pmf of {Ra, Rb, Rd) be /^ ^. 

First we consider the Type-I subtree formed by za and zr). From Proposition 
13.11 one can establish that 6, tjj and ta + tab can be expressed as a function 
of the joint pmf of {Ra, Rd) and hence of /^ ^. A symmetric argument also 
establishes that tb + tab can be expressed as functions of /n r- Next we will 
show that each of za, zb and zab can be expressed as function of /^ j^. 



Consider a random subsample of size one from each of za , zb and zd . Let 
nA,nB and njj be the numbers of subsampled haploids at za,zb and zb, re- 
spectively. (Thus, UA = nB — no = 1). Let rA,rB and ro, respectively, be 
the observed allele-counts at these subsamples. (r^ = or 1 for k = A,B,D.) 
As before, let n'^. be the number of lineages ancestral to subsamples at Zk that 
are present at the top node of the branch (in the subtree) attached to Zk (see 
Figure 1(e)) and r'f. be the allele-count out of these nj. (fc = A, B, D). 

From Eq. ^ it follows that ua = nB = no ~ n'^ = n'g = n'jy — 1 and thus 
Pr (n'j, = i^ I nfe = ik] Tk) does not involve t^ (fc ~ A, B, D). Hence, 

Pr((rA,rs,r_D) = (0,0,1) | n^ = n_B = n_D = 1]ta:Tb,tab,td,0) 

does not involve tatTb and tb- Also, 

Pr((rA,rs,r£)) == (0,0,1) | n^ ^ n^ = 71d = l;rAB,0) 

lA-(iA-JA) Ib-Hb-Jb) Ic~{ic-3c) I (Jk\ fIk~Jk\ 



E E 
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Thus, the left side of Eq. ([T0|) can be expressed as a function of /n r. It also 
follows from Eq. (P that rt = r'^.,k ^ A,B, D. 

Let nAB — n'j^ + n'g be the total number of lineages from subsamples of za 
and Zb that are present at node AB, and let tab = ^a + ''s be the allele-counts 
out of these uab lineages. Also, let Uab be the number of lineages ancestral 
to those UAB lineages that are present at the top node (root) of the branch 
AB, and let r^^ be the allele-count out of these n'^^ lineages. As before, let 
no = n'j^g -\- n'jj be the total number of lineages at the root ancestral to the 
subsamples at za, zb and zo] let tq = r^^ + r^ be the allele-count out of these 
no lineages. Note that uab = n'A + n'^ = 2, n^^ < nAB- From Eq. 1^ and 
the fact that tab = ^4 + ^r it follows that 



Thus, 



{rA,rB) = (0, 0) ^=^ {r'A,r'B) = (0, 0) ^=> tab = 0. 

Fr {{rA,rB,rD) = (0,0,1) \nA = nB ^ no = 1;tab,0) 
= Pr{{rAB,rD)^{0,l)\{nAB,nD)^{2,l);TAB,e) (11) 

Consider the part of the subtree consisting of the path from zab and zd to 
the root; it is a Type-I subtree with zab and zd as the tips, and tab and td, 
respectively, as the lengths of the attached branches; it has {nAB, no) — (2,0), 
respectively, as the numbers of observed lineages at zab and zb and (r^s , r-^ ) , 
respectively, as the allele-counts in these lineages. From Eq. © and (fTTj) 



TAB = b[^Pr{{rAB,rD)^ {0,1) \{nAB, no) ^i2,iy,TAB,0),e 

= b(Pr{{rA,rB,rD) = (0,0,1) | n^ = n_B = n_D = l;TAB,d) , 



As we have already established that ta + tab, tb + tab, td, 6 and the left side 
of Eq. ([Tot can be expressed as functions of /^ ^, it follows that ta, tb,tab,td 
and 9 can be expressed as functions of /^ j^. Thus, they are identifiable and 
this proposition is proven. 

Thus, the parameters of the tree are identifiable, as each two-tip subtree 
along with the root distribution parameter 6 is identifiable. 

4 Discussions 

We have proven that the model parameters are identifiable under the coalescent- 
based population tree model of [H [S]. Thus, the problem of estimation of popu- 
lation tree from this model is indeed meaningfully stated. Moreover, as identifi- 
ability is a required condition for consistency of maximum likelihood estimator 
(MLE), this is a step towards proving the consistency of MLE for this model. 
We have proven the identifiability of the tree parameters for any identifiable 
root distribution. As a result our proof is valid for different versions of this 
model (that vary at the root distribution) such as [H O |6] . 
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